Final Project - Analyzing Sales Data
Date: 24th May 2023
Author: Peerapan Nantang
Course: Pandas Foundation
# import data
import pandas as pd
df = pd.read_csv("sample-store.csv")
# preview top 5 rows
df.head()
# shape of dataframe
df.shape
(9994, 21)
# see data frame information using .info()
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9994 entries, 0 to 9993
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Row ID 9994 non-null int64
1 Order ID 9994 non-null object
2 Order Date 9994 non-null object
3 Ship Date 9994 non-null object
4 Ship Mode 9994 non-null object
5 Customer ID 9994 non-null object
6 Customer Name 9994 non-null object
7 Segment 9994 non-null object
8 Country/Region 9994 non-null object
9 City 9994 non-null object
10 State 9994 non-null object
11 Postal Code 9983 non-null float64
12 Region 9994 non-null object
13 Product ID 9994 non-null object
14 Category 9994 non-null object
15 Sub-Category 9994 non-null object
16 Product Name 9994 non-null object
17 Sales 9994 non-null float64
18 Quantity 9994 non-null int64
19 Discount 9994 non-null float64
20 Profit 9994 non-null float64
dtypes: float64(4), int64(2), object(15)
memory usage: 1.6+ MB
RangeIndex: 9994 entries, 0 to 9993
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Row ID 9994 non-null int64
1 Order ID 9994 non-null object
2 Order Date 9994 non-null object
3 Ship Date 9994 non-null object
4 Ship Mode 9994 non-null object
5 Customer ID 9994 non-null object
6 Customer Name 9994 non-null object
7 Segment 9994 non-null object
8 Country/Region 9994 non-null object
9 City 9994 non-null object
10 State 9994 non-null object
11 Postal Code 9983 non-null float64
12 Region 9994 non-null object
13 Product ID 9994 non-null object
14 Category 9994 non-null object
15 Sub-Category 9994 non-null object
16 Product Name 9994 non-null object
17 Sales 9994 non-null float64
18 Quantity 9994 non-null int64
19 Discount 9994 non-null float64
20 Profit 9994 non-null float64
dtypes: float64(4), int64(2), object(15)
memory usage: 1.6+ MB
We can use pd.to_datetime() function to convert columns 'Order Date' and 'Ship Date' to datetime.
# example of pd.to_datetime() function
pd.to_datetime(df['Order Date'].head(), format='%m/%d/%Y')
# TODO - convert order date to datetime in the original dataframe
df['Order Date'] = pd.to_datetime(df['Order Date'], format='%m/%d/%Y')
df['Order Date']
# TODO - convert ship date to datetime in the original dataframe
df['Ship Date'] = pd.to_datetime(df['Ship Date'], format='%m/%d/%Y')
df['Ship Date']
# TODO - count nan in postal code column
df['Postal Code'].isna().sum()
11
# TODO - filter rows with missing values
# df.dropna(subset=['Postal Code'])
df = df.dropna()
df
# TODO - Explore this dataset on your owns, ask your own questions
Data Analysis Part
Answer 10 below questions to get credit from this course. Write pandas code to find answers.
# TODO 01 - how many columns, rows in this dataset
df.shape
# Row = 9983 rows
# Column = 21 columns
(9983, 21)
# TODO 02 - is there any missing values?, if there is, which colunm? how many nan values?
df.info()
# There is no any missing value.
<class 'pandas.core.frame.DataFrame'>
Int64Index: 9983 entries, 0 to 9993
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Row ID 9983 non-null int64
1 Order ID 9983 non-null object
2 Order Date 9983 non-null datetime64[ns]
3 Ship Date 9983 non-null datetime64[ns]
4 Ship Mode 9983 non-null object
5 Customer ID 9983 non-null object
6 Customer Name 9983 non-null object
7 Segment 9983 non-null object
8 Country/Region 9983 non-null object
9 City 9983 non-null object
10 State 9983 non-null object
11 Postal Code 9983 non-null float64
12 Region 9983 non-null object
13 Product ID 9983 non-null object
14 Category 9983 non-null object
15 Sub-Category 9983 non-null object
16 Product Name 9983 non-null object
17 Sales 9983 non-null float64
18 Quantity 9983 non-null int64
19 Discount 9983 non-null float64
20 Profit 9983 non-null float64
dtypes: datetime64[ns](2), float64(4), int64(2), object(13)
memory usage: 1.7+ MB
Int64Index: 9983 entries, 0 to 9993
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Row ID 9983 non-null int64
1 Order ID 9983 non-null object
2 Order Date 9983 non-null datetime64[ns]
3 Ship Date 9983 non-null datetime64[ns]
4 Ship Mode 9983 non-null object
5 Customer ID 9983 non-null object
6 Customer Name 9983 non-null object
7 Segment 9983 non-null object
8 Country/Region 9983 non-null object
9 City 9983 non-null object
10 State 9983 non-null object
11 Postal Code 9983 non-null float64
12 Region 9983 non-null object
13 Product ID 9983 non-null object
14 Category 9983 non-null object
15 Sub-Category 9983 non-null object
16 Product Name 9983 non-null object
17 Sales 9983 non-null float64
18 Quantity 9983 non-null int64
19 Discount 9983 non-null float64
20 Profit 9983 non-null float64
dtypes: datetime64[ns](2), float64(4), int64(2), object(13)
memory usage: 1.7+ MB
# TODO 03 - your friend ask for `California` data, filter it and export csv for him
## filter `California` data
cal_data = df[df['State'] == 'California']
## export csv
cal_data.to_csv('california_data.csv', index=False)
# TODO 04 - your friend ask for all order data in `California` and `Texas` in 2017 (look at Order Date), send him csv file
## filter the data for rows where the ‘State’ column is equal to ‘California’ or ‘Texas’ and the ‘Order Date’ column falls within the year 2017
cal_tex_data = (df['State'].isin(['California', 'Texas'])) & (df['Order Date'].dt.year == 2017)
ct_2017 = df[cal_tex_data]
ct_2017
## export the filtered data to a CSV file
ct_2017.to_csv('california_texas_data.csv', index=False)
# TODO 05 - how much total sales, average sales, and standard deviation of sales your company make in 2017
sales_2017 = df[df['Order Date'].dt.year == 2017]['Sales']
total_sales = sales_2017.sum()
average_sales = sales_2017.mean()
std_sales = sales_2017.std()
print(f'Total sales in 2017: {total_sales:.2f}')
print(f'Average sales in 2017: {average_sales:.2f}')
print(f'Standard deviation of sales in 2017: {std_sales:.2f}')
Total sales in 2017: 484247.50
Average sales in 2017: 242.97
Standard deviation of sales in 2017: 754.05
Average sales in 2017: 242.97
Standard deviation of sales in 2017: 754.05
# TODO 06 - which Segment has the highest profit in 2018
profit_2018 = df[df['Order Date'].dt.year == 2018].groupby('Segment')['Profit'].sum()
profit_2018
highest_profit_segment = profit_2018.idxmax()
highest_profit_segment
'Consumer'
highest_profit = profit_2018.max()
highest_profit
28281.3665
print(f'The {highest_profit_segment} segment had the highest profit in 2018 with a total profit of {highest_profit:.2f}')
The Consumer segment had the highest profit in 2018 with a total profit of 28281.37
# TODO 07 - which top 5 States have the least total sales between 15 April 2019 - 31 December 2019
start_date = '2019-04-15'
end_date = '2019-12-31'
date_range = (df['Order Date'] >= start_date) & (df['Order Date'] <= end_date)
sales = df[date_range].groupby('State')['Sales'].sum()
least_sales_states = sales.nsmallest(5)
print('The top 5 states with the least total sales between 15 April 2019 and 31 December 2019 are:')
print(least_sales_states)
The top 5 states with the least total sales between 15 April 2019 and 31 December 2019 are:
State
New Hampshire 49.05
New Mexico 64.08
District of Columbia 117.07
Louisiana 249.80
South Carolina 502.48
Name: Sales, dtype: float64
State
New Hampshire 49.05
New Mexico 64.08
District of Columbia 117.07
Louisiana 249.80
South Carolina 502.48
Name: Sales, dtype: float64
# TODO 08 - what is the proportion of total sales (%) in West + Central in 2019 e.g. 25%
subset_df = (df['Order Date'].dt.year == 2019) & (df['Region'].isin(['West', 'Central']))
subset_df
west_central_sales = df[subset_df]['Sales'].sum()
west_central_sales
334909.5525
total_sales = df[df['Order Date'].dt.year == 2019]['Sales'].sum()
total_sales
606238.5380000001
proportion = (west_central_sales / total_sales) * 100
print(f'The proportion of total sales in West and Central regions in 2019 is: {proportion:.2f}%')
The proportion of total sales in West and Central regions in 2019 is: 55.24%
# TODO 09 - find top 10 popular products in terms of number of orders vs. total sales during 2019-2020
subset = (df['Order Date'].dt.year >= 2019) & (df['Order Date'].dt.year <= 2020)
group = df[subset].groupby('Product Name').agg({'Order ID': 'nunique', 'Sales': 'sum'})
group
top_10_orders = group.nlargest(10, 'Order ID')
top_10_sales = group.nlargest(10, 'Sales')
print('Top 10 popular products in terms of number of orders during 2019-2020:')
print(top_10_orders['Order ID'])
print('\nTop 10 popular products in terms of total sales during 2019-2020:')
print(top_10_sales['Sales'])
Top 10 popular products in terms of number of orders during 2019-2020:
Product Name
Easy-staple paper 27
Staples 24
Staple envelope 22
Staples in misc. colors 13
Chromcraft Round Conference Tables 12
Staple remover 12
Storex Dura Pro Binders 12
Avery Non-Stick Binders 11
Global Wood Trimmed Manager's Task Chair, Khaki 11
GBC Instant Report Kit 10
Name: Order ID, dtype: int64
Top 10 popular products in terms of total sales during 2019-2020:
Product Name
Canon imageCLASS 2200 Advanced Copier 61599.824
Hewlett Packard LaserJet 3310 Copier 16079.732
3D Systems Cube Printer, 2nd Generation, Magenta 14299.890
GBC Ibimaster 500 Manual ProClick Binding System 13621.542
GBC DocuBind TL300 Electric Binding System 12737.258
GBC DocuBind P400 Electric Binding System 12521.108
Samsung Galaxy Mega 6.3 12263.708
HON 5400 Series Task Chairs for Big and Tall 11846.562
Martin Yale Chadless Opener Electric Letter Opener 11825.902
Global Troy Executive Leather Low-Back Tilter 10169.894
Name: Sales, dtype: float64
Product Name
Easy-staple paper 27
Staples 24
Staple envelope 22
Staples in misc. colors 13
Chromcraft Round Conference Tables 12
Staple remover 12
Storex Dura Pro Binders 12
Avery Non-Stick Binders 11
Global Wood Trimmed Manager's Task Chair, Khaki 11
GBC Instant Report Kit 10
Name: Order ID, dtype: int64
Top 10 popular products in terms of total sales during 2019-2020:
Product Name
Canon imageCLASS 2200 Advanced Copier 61599.824
Hewlett Packard LaserJet 3310 Copier 16079.732
3D Systems Cube Printer, 2nd Generation, Magenta 14299.890
GBC Ibimaster 500 Manual ProClick Binding System 13621.542
GBC DocuBind TL300 Electric Binding System 12737.258
GBC DocuBind P400 Electric Binding System 12521.108
Samsung Galaxy Mega 6.3 12263.708
HON 5400 Series Task Chairs for Big and Tall 11846.562
Martin Yale Chadless Opener Electric Letter Opener 11825.902
Global Troy Executive Leather Low-Back Tilter 10169.894
Name: Sales, dtype: float64
# TODO 10 - plot at least 2 plots, any plot you think interesting :)
## Bar plot of total sales by category
import matplotlib.pyplot as plt
import seaborn as sns
sales_by_category = df.groupby('Category')['Sales'].sum()
sns.barplot(x=sales_by_category.index, y=sales_by_category.values)
plt.title('Total Sales by Category')
plt.xlabel('Category')
plt.ylabel('Total Sales')
plt.show()
## Line plot of monthly sales over time
df['Order Date'] = pd.to_datetime(df['Order Date'])
monthly_sales = df.resample('M', on='Order Date')['Sales'].sum()
monthly_sales.plot()
plt.title('Monthly Sales Over Time')
plt.xlabel('Order Date')
plt.ylabel('Total Sales')
plt.show()
## Heatmap of sales by region and category
sales_by_region_category = df.pivot_table(index='Region', columns='Category', values='Sales', aggfunc='sum')
sns.heatmap(sales_by_region_category, annot=True, fmt='.2f')
plt.title('Sales by Region and Category')
plt.show()
# TODO Bonus - use np.where() to create new column in dataframe to help you answer your own questions
import numpy as np
df['High Sales'] = np.where(df['Sales'] > 500, 'Yes', 'No')
df['High Sales']
## How many orders had high sales
df['High Sales'].value_counts()